massively multilingual
CS-FLEURS: A Massively Multilingual and Code-Switched Speech Dataset
Yan, Brian, Hamed, Injy, Shimizu, Shuichiro, Lodagala, Vasista, Chen, William, Iakovenko, Olga, Talafha, Bashar, Hussein, Amir, Polok, Alexander, Chang, Kalvin, Klement, Dominik, Althubaiti, Sara, Peng, Puyuan, Wiesner, Matthew, Solorio, Thamar, Ali, Ahmed, Khudanpur, Sanjeev, Watanabe, Shinji, Chen, Chih-Chen, Wu, Zhen, Benharrak, Karim, Diwan, Anuj, Cornell, Samuele, Yeo, Eunjung, Choi, Kwanghee, Carvalho, Carlos, Rosero, Karen
CS-FLEURS consists of 4 test sets which cover in total 113 unique code-switched language pairs across 52 languages: 1) a 14 X-English language pair set with real voices reading synthetically generated code-switched sentences, 2) a 16 X-English language pair set with generative text-to-speech 3) a 60 {Arabic, Mandarin, Hindi, Spanish}-X language pair set with the generative text-to-speech, and 4) a 45 X-English lower-resourced language pair test set with concatenative text-to-speech. Besides the four test sets, CS-FLEURS also provides a training set with 128 hours of generative text-to-speech data across 16 X-English language pairs. Our hope is that CS-FLEURS helps to broaden the scope of future code-switched speech research.
SeamlessM4T: Massively Multilingual & Multimodal Machine Translation
Communication, Seamless, Barrault, Loïc, Chung, Yu-An, Meglioli, Mariano Cora, Dale, David, Dong, Ning, Duquenne, Paul-Ambroise, Elsahar, Hady, Gong, Hongyu, Heffernan, Kevin, Hoffman, John, Klaiber, Christopher, Li, Pengwei, Licht, Daniel, Maillard, Jean, Rakotoarison, Alice, Sadagopan, Kaushik Ram, Wenzek, Guillaume, Ye, Ethan, Akula, Bapi, Chen, Peng-Jen, Hachem, Naji El, Ellis, Brian, Gonzalez, Gabriel Mejia, Haaheim, Justin, Hansanti, Prangthip, Howes, Russ, Huang, Bernie, Hwang, Min-Jae, Inaguma, Hirofumi, Jain, Somya, Kalbassi, Elahe, Kallet, Amanda, Kulikov, Ilia, Lam, Janice, Li, Daniel, Ma, Xutai, Mavlyutov, Ruslan, Peloquin, Benjamin, Ramadan, Mohamed, Ramakrishnan, Abinesh, Sun, Anna, Tran, Kevin, Tran, Tuan, Tufanov, Igor, Vogeti, Vish, Wood, Carleigh, Yang, Yilin, Yu, Bokai, Andrews, Pierre, Balioglu, Can, Costa-jussà, Marta R., Celebi, Onur, Elbayad, Maha, Gao, Cynthia, Guzmán, Francisco, Kao, Justine, Lee, Ann, Mourachko, Alexandre, Pino, Juan, Popuri, Sravya, Ropers, Christophe, Saleem, Safiyyah, Schwenk, Holger, Tomasello, Paden, Wang, Changhan, Wang, Jeff, Wang, Skyler
What does it take to create the Babel Fish, a tool that can help individuals translate speech between any two languages? While recent breakthroughs in text-based models have pushed machine translation coverage beyond 200 languages, unified speech-to-speech translation models have yet to achieve similar strides. More specifically, conventional speech-to-speech translation systems rely on cascaded systems that perform translation progressively, putting high-performing unified systems out of reach. To address these gaps, we introduce SeamlessM4T, a single model that supports speech-to-speech translation, speech-to-text translation, text-to-speech translation, text-to-text translation, and automatic speech recognition for up to 100 languages. To build this, we used 1 million hours of open speech audio data to learn self-supervised speech representations with w2v-BERT 2.0. Subsequently, we created a multimodal corpus of automatically aligned speech translations. Filtered and combined with human-labeled and pseudo-labeled data, we developed the first multilingual system capable of translating from and into English for both speech and text. On FLEURS, SeamlessM4T sets a new standard for translations into multiple target languages, achieving an improvement of 20% BLEU over the previous SOTA in direct speech-to-text translation. Compared to strong cascaded models, SeamlessM4T improves the quality of into-English translation by 1.3 BLEU points in speech-to-text and by 2.6 ASR-BLEU points in speech-to-speech. Tested for robustness, our system performs better against background noises and speaker variations in speech-to-text tasks compared to the current SOTA model. Critically, we evaluated SeamlessM4T on gender bias and added toxicity to assess translation safety. Finally, all contributions in this work are open-sourced and accessible at https://github.com/facebookresearch/seamless_communication